A Very Efficient Scheme for Estimating Entropy of Data Streams Using Compressed Counting

نویسنده

  • Ping Li
چکیده

Compressed Counting (CC) was recently proposed for approximating the αth frequency moments of data streams, for 0 < α ≤ 2. Under the relaxed strict-Turnstile model, CC dramatically improves the standard algorithm based on symmetric stable random projections, especially as α → 1. A direct application of CC is to estimate the entropy, which is an important summary statistic in Web/network measurement and often serves a crucial “feature” for data mining. The Rényi entropy and the Tsallis entropy are functions of the αth frequency moments; and both approach the Shannon entropy as α → 1. A recent theoretical work suggested using the αth frequency moment to approximate the Shannon entropy with α = 1+δ and very small |δ| (e.g., < 10). In this study, we experiment using CC to estimate frequency moments, Rényi entropy, Tsallis entropy, and Shannon entropy, on real Web crawl data. We demonstrate the variance-bias trade-off in estimating Shannon entropy and provide practical recommendations. In particular, our experiments enable us to draw some important conclusions:

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Estimating Entropy of Data Streams Using Compressed Counting

The Shannon entropy is a widely used summary statistic, for example, network traffic measurement, anomaly detection, neural computations, spike trains, etc. This study focuses on estimating Shannon entropy of data streams. It is known that Shannon entropy can be approximated by Rényi entropy or Tsallis entropy, which are both functions of the αth frequency moments and approach Shannon entropy a...

متن کامل

Improving Compressed Counting

Compressed Counting (CC) [22] was recently proposed for estimating the αth frequency moments of data streams, where 0 < α ≤ 2. CC can be used for estimating Shannon entropy, which can be approximated by certain functions of the αth frequency moments as α → 1. Monitoring Shannon entropy for anomaly detection (e.g., DDoS attacks) in large networks is an important task. This paper presents a new a...

متن کامل

Entropy Estimations Using Correlated Symmetric Stable Random Projections

Methods for efficiently estimating Shannon entropy of data streams have important applications in learning, data mining, and network anomaly detections (e.g., the DDoS attacks). For nonnegative data streams, the method of Compressed Counting (CC) [11, 13] based on maximally-skewed stable random projections can provide accurate estimates of the Shannon entropy using small storage. However, CC is...

متن کامل

A New Algorithm for Compressed Counting with Applications in Shannon Entropy Estimation in Dynamic Data

Efficient estimation of the moments and Shannon entropy of data streams is an important task in modern machine learning and data mining. To estimate the Shannon entropy, it suffices to accurately estimate the α-th moment with ∆ = |1 − α| ≈ 0. To guarantee that the error of estimated Shannon entropy is within a ν-additive factor, the method of symmetric stable random projections requires O ( 1 ν...

متن کامل

On the Sample Complexity of Compressed Counting

The problem of “scaling up for high dimensional data and high speed data streams” is among the “ten challenging problems in data mining research”[36]. This paper is devoted to estimating entropy of data streams. Mining data streams[19, 4, 1, 29] in (e.g.,) 100 TB scale databases has become an important area of research, e.g., [10, 1], as network data can easily reach that scale[36]. Search engi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/0808.1771  شماره 

صفحات  -

تاریخ انتشار 2008